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[57] ABSTRACT 

An array processor includes processing elements arranged in 
clusters which are, in turn, combined in a rectangular array. 
Each cluster is formed of processing elements which pref- 
erably communicate with the processing elements of at least 
two other clusters. Additionally each inter-cluster commu- 
nication path is mutually exclusive, that is, each path carries 
either north and west, south and east, north and east, or south 
and west communications. Due to the mutual exclusivity of 
the data paths, communications between the processing 
elements of each cluster may be combined in a single 
inter-cluster path. That is, communications from a cluster 
which communicates to the north and east with another 
cluster may be combined in one path, thus eliminating half 
the wiring required for the path. Additionally, the length of 
the longest communication path is not directly determined 
by the overall dimension of the array, as it is in conventional 
torus arrays. Rather, the longest communications path is 
limited only by the inter-cluster spacing. In one 
implementation, transpose elements of an NxN torus are 
combined in clusters and communicate with one another 
through intra-cluster communications paths. Since transpose 
elements have direct connections to one another, transpose 
operation latency is eliminated in this approach. 
Additionally, each PE may have a single transmit port and 
a single receive port. As a result, the individual PEs are 
decoupled from the topology of the array. 

28 Claims, 33 Drawing Sheets 
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MANIFOLD ARRAY PROCESSOR 
BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to processing systems in 
general and, more specifically, to parallel processing archi- 
tectures. 

2. Description of the Related Art 

Many computing tasks can be developed that operate in 
parallel on data. The efficiency of the parallel processor 
depends upon the parallel processor's architecture, the 
coded algorithms, and the placement of data in the parallel 
elements. For example, image processing, pattern 
recognition, and computer graphics are all applications 
which operate on data that is naturally arranged in two- or 
three-dimensional grids. The data may represent a wide 
variety of signals, such as audio, video, SONAR or RADAR 
signals, by way of example. Because operations such as 
discrete cosine transforms (DCT), inverse discrete cosine 
transforms (IDCT), convolutions, and the like which are 
commonly performed on such data may be performed upon 
different grid segments simultaneously, multiprocessor array 
systems have been developed which, by allowing more than 
one processor to work on the task at one time, may signifi- 
cantly accelerate such operations. Parallel processing is the 
subject of a large number patents including U.S. Pat. Nos. 
5,065,339; 5,146,543; 5,146,420; 5,148,515; 5,546,336; 
5,542,026; 5,612,908 and 5,577,262; European Published 
Application Nos. 0,726,529 and 0,726,532 which are hereby 
incorporated by reference. 

One conventional approach to parallel processing archi- 
tectures is the nearest neighbor mesh connected computer, 
which is discussed in R. Cypher and J. L. C. Sanz, SIMD 
Architectures and Algorithms for Image Processing and 
Computer Vision, IEEE Transactions on Acoustics, Speech 
and Signal Processing, Vol. 37, No. 12, pp. 2158-2174, 
December 1989; K. E, Batcher, Design of a Massively 
Parallel Processor, IEEE Transactions on Computers, Vol. 
C-29 No. 9, pp. 836-S40 September 1980; and L. Uhr, 
Multi-Computer Architectures for Artificial Intelligence, 
New York, N.Y., John Wiley & Sons, Ch. 8, p. 97, 1987. 

In the nearest neighbor torus connected computer of FIG. 
1A multiple processing elements (PEs) are connected to 
their north, south, east and west neighbor PEs through torus 
connection paths MP and all PEs are operated in a synchro- 
nous single instruction multiple data (SIMD) fashion. Since 
a torus connected computer may be obtained by adding 
wraparound connections to a mesh-connected computer, a 
mesh-connected computer, one without wraparound 
connections, may be thought of as a subset of torus con- 
nected computers. As illustrated in FIG. IB, each path MP 
may include T transmit wires and R receive wires, or as 
illustrated in FIG. 1C, each path MP may include B bidi- 
rectional wires. Although unidirectional and bidirectional 
communications are both contemplated by the invention, the 
total number of bus wires, excluding control signals, in a 
path will generally be referred to as k wires hereinafter, 
where k=B in a bidirectional bus design and k=T+R in a 
unidirectional bus design. It is assumed that a PE can 
transmit data to any of its neighboring PEs, but only one at 
a time. For example, each PE can transmit data to its east 
neighbor in one communication cycle. It is also assumed that 
a broadcast mechanism is present such that data and instruc- 
tions can be dispatched from a controller simultaneously to 
all PEs in one broadcast dispatch period. 

Although bit-serial inter-PE communications are typically 
employed to minimize wiring complexity, the wiring com- 
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plexity of a torus-connected array nevertheless presents 
implementation problems. The conventional torus- 
connected array of FIG. LA includes sixteen processing 
elements connected in a four by four array 10 of PEs. Each 
5 processing element PE^. is labeled with its row and column 
number i and j, respectively. Each PE communicates to its 
nearest North (N), South (S), East (E) and West (W) 
neighbor with point to point connections. For example, the 
connection between PE 0 0 and PE 3 0 shown in FIG. 1A is a 
wraparound connection between fcEoo's N interface and 
PE 3 0 's south interface, representing one of the wraparound 
interfaces that forms the array into a torus configuration. In 
such a configuration, each row contains a set of N intercon- 
nections and, with N rows, there are N 2 horizontal connec- 
tions. 

15 Similarly, with N columns having N vertical interconnec- 
tions each, there are N 2 vertical interconnections. For the 
example of FIG. 1A, N=4, The total number of wires, such 
as the metallization lines in an integrated circuit implemen- 
tation in an NxN torus-connected computer including wrap- 

20 around connections, is therefore 2kN 2 , where k is the 
number of wires in each interconnection. The number k may 
be equal to one in a bit serial interconnection. For example 
with k=l for the 4x4 array 10 as shown in FIG. 1A, 
2kN 2 "32. 

25 For a number of applications where N is relatively small, 
it is preferable that the entire PE array is incorporated in a 
single integrated circuit. The invention does not preclude 
implementations where each PE can be a separate micro- 
processor chip, for example. Since the total number of wires 

30 in a torus connected computer can be significant, the inter- 
connections may consume a great deal of valuable integrated 
circuit "real estate", or the area of the chip taken up. 
Additionally, the PE interconnection paths quite frequently 
cross over one another complicating the IC layout process 

35 and possibly introducing noise to the communications lines 
through crosstalk. Furthermore, the length of wraparound 
links, which connect PEs at the North and South and at the 
East and West extremes of the array, increase with increasing 
array size. This increased length increases each communi- 

40 cation line's capacitance, thereby reducing the line's maxi- 
mum bit rate and introducing additional noise to the line. 

Another disadvantage of the torus array arises in the 
context of transpose operations. Since a processing element 
and its transpose are separated by one or more intervening 

45 processing elements in the communications path, latency is 
introduced in operations which employ transposes. For 
example, should the PE 2 4 require data from its transpose, 
PE lj2 , the data must travel through the intervening PE i a or 
PE 2j2 . Naturally, this introduces a delay into the operation, 

50 even if PE 1(1 , and PE 2 ^ are not otherwise occupied. 
However, in the general case where the PEs are implemented 
as micro-processor elements, there is a very good probabil- 
ity that PEj j and PI^ 2 will be performing other operations 
and, in order to transfer data or commands from PEj^ 2 to 

55 PE^, they will have to set aside these operations in an 
orderly fashion. Therefore, it may take several operations to 
even begin transferring the data or commands from PE 1>2 to 
PEj i and the operations PE lfJ was forced to set aside to 
transfer the transpose data will also be delayed. Such delays 

60 snowball with every intervening PE and significant latency 
is introduced for the most distant of the transpose pairs. For 
example the ?E 31 /PE 13 transpose pair of FIG. 1A, has a 
minimum of three intervening PEs, requiring a latency of 
four communication steps and could additionally incur the 

65 latency of all the tasks which must be set aside in all those 
PEs in order to transfer data between PE3 j and PE 13 in the 
general case. 
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Recognizing such limitations of torus connected arrays, 
new approaches to arrays have been disclosed in U.S. Pat. 
No. 5,612,908; A Massively Parallel Diagonal Fold Array 
Processor, G. G. Pechanek et al., 1993 International Con- 
ference on Application Specific Array Processors, pp. 5 
140-143, Oct. 25-27, 1993, Venice, Italy, and Multiple Fold 
Clustered Processor Torus Array, G. G. Pechanek, et. al., 
Proceedings Fifth NASA Symposium on VLSI Design, pp. 
8.4.1-11, Nov. 4-5, 1993, University of New Mexico, 
Albuquerque, N.M. which are incorporated by reference 10 
herein in their entirety. The operative technique of these 
torus array organizations is the folding of arrays of PEs 
using the diagonal PEs of the conventional nearest neighbor 
torus as the foldover edge. As illustrated in the array 20 of 
FIG. 2, these techniques may be employed to substantially 15 
reduce inter-PE wiring, to reduce the number and length of 
wraparound connections, and to position PEs in close prox- 
imity to their transpose PEs. This processor array architec- 
ture is disclosed, by way of example, in U.S. Pat. Nos. 
5,577,262, 5,612,908, and EP 0,726,532 and EP 0,726,529 20 
which were invented by the same inventor as the present 
invention and are incorporated herein by reference in their 
entirety. While such arrays provide substantial benefits over 
the conventional torus architecture, due to the irregularity of 
PE combinations, for example in a single fold diagonal fold 25 
mesh, some PEs are clustered "in twos", others are single, in 
a three fold diagonal fold mesh there are clusters of four PEs 
and eight PEs. Due to an overall triangular shape of the 
arrays, the diagonal fold type of array presents substantial 
obstacles to efficient, inexpensive integrated circuit imple- 30 
mentation. Additionally, in a diagonal fold mesh as in EP 
0,726,532 and EP 0,726,529, and other conventional mesh 
architectures, the interconnection topology is inherently part 
of the PE definition. This fixes the PE's position in the 
topology, consequently limiting the topology of the PEs and 35 
their connectivity to the fixed configuration that is imple- 
mented. Thus, a need exists for further improvements in 
processor array architecture and processor interconnection. 



SUMMARY OF THE INVENTION 
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The present invention is directed to an array of processing 
elements which substantially reduce the array's intercon- 
nection wiring requirements when compared to the wiring 
requirements of conventional torus processing element 
arrays. In a preferred embodiment, one array in accordance 45 
with the present invention achieves a substantial reduction in 
the latency of transpose operations. Additionally, the inven- 
tive array decouples the length of wraparound wiring from 
the array's overall dimensions, thereby reducing the length 
of the longest interconnection wires. Also, for array com- 50 
munication patterns that cause no conflict between the 
communicating PEs, only one transmit port and one receive 
port are required per PE, independent of the number of 
neighborhood connections a particular topology may require 
of its PE nodes. A preferred integrated circuit implementa- 55 
tion of the array includes a combination of similar process- 
ing element clusters combined to present a rectangular or 
square outline. The similarity of processing elements, the 
similarity of processing element clusters, and the regularity 
of the array's overall outline make the array particularly 60 
suitable for cost-effective integrated circuit manufacturing. 

To form an array in accordance with the present invention, 
processing elements may first be combined into clusters 
which capitalize on the communications requirements of 
single instruction multiple data ("SIMD") operations. Pro- 65 
cessing elements may then be grouped so that the elements 
of one cluster communicate within a cluster and with 



members of only two other clusters. Furthermore, each 
cluster's constituent processing elements communicate in 
only two mutually exclusive directions with the processing 
elements of each of the other clusters. By definition, in a 
SIMD torus with unidirectional communication capability, 
the North/South directions are mutually exclusive with the 
East/West directions. Processing element clusters are, as the 
name implies, groups of processors formed preferably in 
close physical proximity to one another. In an integrated 
circuit implementation, for example, the processing ele- 
ments of a cluster preferably would be laid out as close to 
one another as possible, and preferably closer to one another 
than to any other processing element in the array. For 
example, an array corresponding to a conventional four by 
four torus array of processing elements may include four 
clusters of four elements each, with each cluster communi- 
cating only to the North and East with one other cluster and 
to the South and West with another cluster, or to the South 
and East with one other cluster and to the North and West 
with another cluster. By clustering PEs in this manner, 
communications paths between PE clusters may be shared, 
through multiplexing, thus substantially reducing the inter- 
connection wiring required for the array. 

In a preferred embodiment, the PEs comprising a cluster 
are chosen so that processing elements and their transposes 
are located in the same cluster and communicate with one 
another through intra-cluster communications paths, thereby 
eliminating the latency associated with transpose operations 
carried out on conventional torus arrays. Additionally, since 
the conventional wraparound path is treated the same as any 
PE-to-PE path, the longest communications path may be as 
short as the inter-cluster spacing, regardless of the array's 
overall dimension. According to the invention an NxM torus 
may be transformed into an array of M clusters of N PEs, or 
into N clusters of M PEs. 

These and other features, aspects and advantages of the 
invention will be apparent to those skilled in the art from the 
following detailed description, taken together with the 
accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1A is a block diagram of a conventional prior art 4x4 
nearest neighbor connected torus processing element (PE) 
array; 

FIG. IB illustrates how the prior art torus connection 
paths of FIG. 1A may include T transmit and R receive 
wires; 

FIG. 1C illustrates how prior art torus connection paths of 
FIG. 1A may include B bidirectional wires; 

FIG. 2 is a block diagram of a prior art diagonal folded 
mesh; 

FIG. 3A is a block diagram of a processing element which 
may suitably be employed within the PE array of the present 
invention; 

FIG. 3B is a block diagram of an alternative processing 
element which may suitably be employed within the PE 
array of the present invention; 

FIG. 4 is a tiling of a 4x4 torus which illustrates all the 
torus 's inter-PE communications links; 

FIGS. 5A through 5G are tilings of a 4x4 torus which 
illustrate the selection of PEs for cluster groupings in 
accordance with the present invention; 

FIG. 6 is a tiling of a 4x4 torus which illustrates alterna- 
tive grouping of PEs for clusters; 

FIG. 7 is a tiling of a 3x3 torus which illustrates the 
selection of PEs for PE clusters; 
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FIG. 8 is a tiling of a 3x5 torus which illustrates the 
selection of PEs for PE clusters; 

FIG. 9 is a block diagram illustrating an alternative, 
rhombus/cylinder approach to selecting PEs for PE clusters; 

FIG. 10 is a block diagram which illustrates the inter- 5 
cluster communications paths of the new PE clusters; 

FIGS. 11A and 11B illustrate alternative rhombus/ 
cylinder approaches to PE cluster selection; 

FIG. 12 is a block diagram illustration of the rhombus/ 
cylinder PE selection process for a 5x4 PE array; 

FIG. 13 is a block diagram illustration of the rhombus/ 
cylinder PE selection process for a 4x5 PE array; 

FIG. 14 is a block diagram illustration of the rhombus/ 
cylinder PE selection process for a 5x5 PE array; is 

FIGS. 15A through 15D are block diagram illustrations of 
inter-cluster communications paths for 3, 4, 5, and 6 cluster 
by 6 PE arrays, respectively; 

FIG. 16 is a block diagram illustrating East/South com- 
munications paths within an array of four four-member 20 
clusters; 

FIG. 17 is a block diagram illustration of East/South and 
West/North communications paths within an array of four 
four-member clusters; 

25 

FIG. 18 is a block diagram illustrating one of the clusters 
of the embodiment of FIG. 17, which illustrates in greater 
detail a cluster switch and its interface to the illustrated 
cluster; 

FIGS. 19A and 19B illustrate a convolution window and 30 
convolution path, respectively, employed in an exemplary 
convolution which may advantageously be carried out on the 
new array processor of the present invention; 

FIGS. 19C and 19D are block diagrams which respec- 
tively illustrate a portion of an image within a 4x4 block and 35 
the block loaded into conventional torus locations; and 

FIGS. 20A through 24B are block diagrams which illus- 
trate the state of a manifold array in accordance with the 
present invention at the end of each convolution operational 
step. 40 

DETAILED DESCRIPTION 

In one embodiment, a new array processor in accordance 
with the present invention combines PEs in clusters, or 
groups, such that the elements of one cluster communicate 45 
with members of only two other clusters and each cluster's 
constituent processing elements communicate in only two 
mutually exclusive directions with the processing elements 
of each of the other clusters. By clustering PEs in this 
manner, communications paths between PE clusters may be 50 
shared, thus substantially reducing the interconnection wir- 
ing required for the array. Additionally, each PE may have 
a single transmit port and a single receive port or, in the case 
of a bidirectional sequential or time sliced transmit/receive 
communication implementation, a single transmit/receive 55 
port. As a result, the individual PEs are decoupled from the 
topology of the array. That is, unlike a conventional torus 
connected array where each PE has four bidirectional com- 
munication ports, one for communication in each direction, 
PEs employed by the new array architecture need only have 60 
one port. In implementations which utilize a single transmit 
and a single receive port, all PEs in the array may simulta- 
neously transmit and receive. In the conventional torus, this 
would require four transmit and four receive ports, a total of 
eight ports, per PE, while in the present invention, one 65 
transmit port and one receive port, a total of two ports, per 
PE are required. 
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In one presently preferred embodiment, the PEs compris- 
ing a cluster are chosen so that processing elements and their 
transposes are located in the same cluster and communicate 
with one another through intra-cluster communications 
paths. For convenience of description, processing elements 
are referred to as they would appear in a conventional torus 
array, for example, processing element PE 0 0 is the process- 
ing element that would appear in the "Northwest" corner of 
a conventional torus array. 

Consequently, although the layout of the new cluster array 
is substantially different from that of a conventional array 
processor, the same data would be supplied to corresponding 
processing elements of the conventional torus and new 
cluster arrays. For example, the PEqq element of the new 
cluster array would receive the same data to operate on as 
the PE 0 0 element of a conventional torus-connected array. 
Additionally, the directions referred to in this description 
will be in reference to the directions of a torus-connected 
array. For example, when communications between process- 
ing elements are said to take place from North to South, 
those directions refer to the direction of communication 
within a conventional torus-connected array. 

The PEs may be single microprocessor chips that may be 
of a simple structure tailored for a specific application. 
Though not limited to the following description, a basic PE 
will be described to demonstrate the concepts involved. The 
basic structure of a PE 30 illustrating one suitable embodi- 
ment which may be utilized for each PE of the new PE array 
of the present invention is illustrated in FIG. 3A. For 
simplicity of illustration, interface logic and buffers are not 
shown. A broadcast instruction bus 31 is connected to 
receive dispatched instructions from a SIMD controller 29, 
and a data bus 32 is connected to receive data from memory 
33 or another data source external to the PE 30. A register 
file storage medium 34 provides source operand data to 
execution units 36. An instruction decoder/controller 38 is 
connected to receive instructions through the broadcast 
instruction bus 31 and to provide control signals 21 to 
registers within the register file 34 which, in turn, provide 
their contents as operands via path 22 to the execution units 
36. The execution units 36 receive control signals 23 from 
the instruction decoder/controller 38 and provide results via 
path 24 to the register file 34. The instruction decoder/ 
controller 38 also provides cluster switch enable signals on 
an output the line 39 labeled Switch Enable. The function of 
cluster switches will be discussed in greater detail below in 
conjunction with the discussion of FIG. 18. Inter- PE com- 
munications of data or commands are received at receive 
input 37 labeled Receive and are transmitted from a transmit 
output 35 labeled Send. 

FIG. 3B shows an alternative PE representation 30' that 
includes an interface control unit 50 which provides data 
formatting operations based upon control signals 25 
received from the instruction decoder/controller 38. Data 
formatting operations can include, for example, parallel to 
serial and serial to parallel conversions, data encryption, and 
data format conversions to meet various standards or inter- 
face requirements. 

A conventional 4x4 nearest neighbor torus of PEs of the 
same type as the PE 30 illustrated in FIG. 3 A is shown 
surrounded by tilings of itself in FIG. 4. The center 4x4 torus 
40 is encased by a ring 42 which includes the wraparound 
connections of the torus. The tiling of FIG. 4 is a descriptive 
aid used to "flatten out" the wraparound connections and to 
thereby aid in explanation of the preferred cluster forming 
process utilized in the array of one embodiment of the 
present invention. For example, the wraparound connection 



sion: 1.03.0002 



6,023 

7 

to the west from PEq 0 , is PE 0 3 , that from the PE] 3 to the east 
is PEj 0 , etc., as illustrated within the block 42! The utility 
of this view will be more apparent in relation to the 
discussion below of FIGS. 5A-5G. 

In FIG. 5A, the basic 4x4 PE torus is once again sur- 5 
rounded by tilings of itself. The present invention recognizes 
that communications to the East and South from PE 0 0 
involve PE^ and PE 10 , respectively. Furthermore, the PE 
which communicates to the east to PE 10 is PE J3 and PE 13 
communicates to the South to PE2 3 . Therefore, combining 10 
the four PEs, PE 00 , PE a 3 , PE 22 , and PE 3 2 in one cluster 
yields a cluster 44 from which PEs communicate only to the 
South and East with another cluster 46 which includes PEs, 
PE 0(1 , PEj o, PE 2 3 and PE 32 . Similarly, the PEs of cluster 46 
communicate to the South and East with the PEs of cluster 15 
48 which includes PEs, PE 0 2 , PE 1(1 , PE 20 , and PE 3 3 . The 
PEs, PE 0;3 , PEj 2 , PE^j, and PE^o of cluster 50 communi- 
cate to the South and East with cluster 44. This combination 
yields clusters of PEs which communicate with PEs in only 
two other clusters and which communicate in mutually 2 o 
exclusive directions to those clusters. That is, for example, 
the PEs of cluster 48 communicate only to the South and 
East with the PEs of cluster 50 and only to the North and 
West with the PEs of cluster 46. It is this exemplary of 
grouping of PEs which permits the inter-PE connections 2 s 
within an array in accordance with the present invention to 
be substantially reduced in comparison with the require- 
ments of the conventional nearest neighbor torus array. 

Many other combinations are possible. For example, 
starting again with PE 0 0 and grouping PEs in relation to 30 
communications to the North and East yields clusters 52, 54, 
56 and 58 of FIG. 5B. These clusters may be combined in 
a way which greatly reduces the interconnection require- 
ments of the PE array and which reduces the length of the 
longest inter-PE connection. However, these clusters do not 35 
combine PEs and their transposes as the clusters 44—50 in 
FIG. 5A do. That is, although transpose pairs PEq 2 /PE 2 0 
and VE X3 /PE 31 are contained in cluster 56, the transpose 
pair PEq i/PE-i 0 is split between clusters 54 and 58. An array 
in accordance with the presently preferred embodiment 40 
employs only clusters such as 44-50 which combine all PEs 
with their transposes within clusters. For example, in FIG. 
5A the ?E 31 /?E 1 3 transpose pair is contained within cluster 
44, the PE 32 ,PE 2;J and FE 10 /?E 01 transpose pairs are 
contained within cluster 46, the PE^/PE^o transpose pair is 45 
contained within cluster 48, and the PEg o/PEq 3 and PE^/ 
PEj 2 transpose pairs are contained within cluster 50. Clus- 
ters' 60, 62, 64 and 68 of FIG. 5C are formed, starting at 
PE 0 0 , by combining PEs which communicate to the North 
and West. Note that cluster 60 is equivalent to cluster 44, 50 
cluster 62 is equivalent to cluster 46, cluster 64 is equivalent 
to cluster 48 and cluster 68 is equivalent to cluster 50. 
Similarly, clusters 70 through 76 of FIG. 5D, formed by 
combining PEs which communicate to the South and West, 
are equivalent to clusters 52 through 58, respectively of FIG. 55 
5B. As demonstrated in FIG. 5E, clusters 45, 47, 49 and 51, 
which are equivalent to the preferred clusters 48, 50, 44 and 
46 may be obtained from any "starting point" within the 
torus 40 by combining PEs which communicate to the South 
and East. 60 

Another clustering is depicted in FIG. 5F where clusters 
61, 63, 65, and 67 form a criss cross pattern in the tilings of 
the torus 40. This clustering demonstrates that there are a 
number of ways in which to group PEs to yield clusters 
which communicate with two other clusters in mutually 65 
exclusive directions. That is, PEq 0 and PE^ of cluster 65 
communicate to the East with PEo fl and PE^ 3 , respectively, 
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of cluster 61. Additionally, PEj 1 and PE3 3 of cluster 65 
communicate to the West with P£ 10 and PE 3 2 , respectively, 
of cluster 61. As will be described in greater aetail below, the 
Easterly communications paths just described, that is, those 
between PEq 0 and PE 0 a and between PE 2i2 and PE^ 3 , and 
other inter-cluster paths may be combined with mutually 
exclusive inter-cluster communications paths, through mul- 
tiplexing for example, to reduce by half the number of 
interconnection wires required for inter-PE communica- 
tions. The clustering of FIG. 5F also groups transpose 
elements within clusters. 

One aspect of the new array's scalability is demonstrated 
by FIG. 5G, where a 4x8 torus array is depicted as two 4x4 
arrays 40Aand 40B. One could use the techniques described 
to this point to produce eight four-PE clusters from a 4x8 
torus array. In addition, by dividing the 4x8 torus into two 
4x4 toruses and combining respective clusters into clusters, 
that is clusters 44 A and 44B, 46A and 46B, and so on, for 
example, four eight-PE clusters with all the connectivity and 
transpose relationships of the 4x4 subclusters contained in 
the eight four-PE cluster configuration is obtained. This 
cluster combining approach is general and other scalings are 
possible. 

The presently preferred, but not sole, clustering process 
may also be described as follows. Given an NxN basic torus 
PE, where i-0,1,2, . . . N-l and j-0, 1, 2, . . . N-l, the 
preferred, South- and East -communicating clusters may be 
formed by grouping PE (V , PE ( ; +1)(Af£M£V)j</VAr _ 1)(Aforf/v)) , PE (i - + 

2)(ModU),(j^-2XMod^, . . . ?E (i ^_ 1XM< ^ Q ^_^_ 1MMod ^ > . 

This formula can be rewritten for an NxN torus array with 
N clusters of N PEs in which the cluster groupings can be 
formed by selecting an i and a j, and then using the formula: 
P^ + a)^fodJv,o-+x-°XModN) for a °y id and for a 11 a e{0,l, . . 

FIG. 6 illustrates the production of clusters 44 through 50 
beginning with PE 1 3 and combining PEs which communi- 
cate to the South and East. In fact, the clusters 44 through 50, 
which are the clusters of the preferred embodiment of a 4x4 
torus equivalent of the new array, are obtained by combining 
South and East communicating PEs, regardless of what PE 
within the basic NxN torus 40 is used as a starting point. 
FIGS. 7 and 8 illustrate additional examples of the approach, 
using 3x3 and 3x5 toruses, respectively. 

Another, equivalent way of viewing the cluster-building 
process is illustrated in FIG. 9. In this and similar figures that 
follow, wraparound wires are omitted from the figure for the 
sake of clarity. A conventional 4x4 torus is first twisted into 
a rhombus, as illustrated by the leftward shift of each row. 
This shift serves to group transpose PEs in "vertical slices" 
of the rhombus. To produce equal-size clusters the rhombus 
is, basically, formed into a cylinder. That is, the left-most, or 
western-most, vertical slice 80 is wrapped around to abut the 
eastern-most PE 0 3 in its row. The vertical slice 82 to the east 
of slice 80 is wrapped around to abut PE 0 0 and PE a 3 , and 
the next eastward vertical slice 84 is wrapped around to abut 
PE 01 , PE I 0 and PE^. Although, for the sake of clarity, all 
connections are not shown, all connections remain the same 
as in the original 4x4 torus. The resulting vertical slices 
produce the clusters of the preferred embodiment 44 through 
50 shown in FIG. 5 A, the same clusters produced in the 
manner illustrated in the discussion related to FIGS. 5Aand 
6. In FIG. 10, the clusters created in the rhombus/cylinder 
process of FIG. 9 are "peeled open" for illustrative purposes 
to reveal the inter-cluster connections. For example, all 
inter-PE connections from cluster 44 to cluster 46 are to the 
South and East, as are those from cluster 46 to cluster 48 and 
from cluster 48 to cluster 50 and from cluster 50 to cluster 
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44. This commonality of inter-cluster communications, in In FIG. 15A, clusters 80, 82 and 84 are three PE clusters 
combination with the nature of inter-PE communications in connected through cluster switches 86 and inter-cluster links 
a SIMD process permits a significant reduction in the 88 to one another. To understand how the manifold array PEs 
number of inter-PE connections. As discussed in greater connect to one another to create a particular topology, the 
detail in relation to FIGS. 16 and 17 below, mutually 5 connection view from a PE must be changed from that of a 
exclusive communications, e.g., communications to the single PE to that of the PE as a member of a cluster of PEs. 
South and East from cluster 44 to cluster 46 may be For a manifold array operating in a SIMD unidirectional 
multiplexed onto a common set of interconnection wires communication environment, any PE requires only one 
running between the clusters. Consequently, the inter-PE transmit port and one receive port, independent of the 
connection wiring of the new array, hereinafter referred to as number of connections between the PE and any of its 
the "manifold array", may be substantially reduced, to one directly attached neighborhood of PEs in the conventional 
half the number of interconnection wires associated with a torus In gcneral( for communication patterns that 
conventional nearest neighbor torus array. cause no conflicts between communicating PEs, only one 
The cluster formation process used to produce a manifold transmit and one receive port are required per PE, indepen- 
array is symmetrical and the clusters formed by taking dent of the number of neighborhood connections a particular 
horizontal slices of a vertically shifted torus are the same as t b ^ of its p£s 
clusters formed by taking vertical slices of a horizontally „ . ; ' . A , , , A . . „^ 
shifted torus. FIGS. llAind 11B illustrate the fact that the Fo u ur f?**** 44 ° f four PEs u eac 0 h , arc | 
rhombus/cylinder technique may also be employed to pro- combined in the array of FIG. 15B. Cluster switches 86 and 
duce the preferred clusters from horizontal slices of a communication paths 88 connect the clusters m a mariner 
vertically shifted torus. In FIG. 11 A the columns of a 20 explained in greater detail in the discussion of FIGS. 16, 17, 
conventional 4x4 torus array are shifted vertically to pro- and 18 below. Similarly, five clusters, 90 through 98, of five 
duce a rhombus and in FIG. 11B the rhombus is wrapped PEs each are combined in the array of FIG. 15C. In practice, 
into a cylinder. Horizontal slices of the resulting cylinder the clusters 90-98 are placed as appropriate to ease inte- 
provide the preferred clusters 44 through 50. Any of the grated circuit layout and to reduce the length of the longest 
techniques illustrated to this point may be employed to 2 5 inter-cluster connection. FIG. 15D illustrates a manifold 
create clusters for manifold arrays which provide inter-PE array of six clusters, 99, 100, 101, 102, 104, and 106, having 
connectivity equivalent to that of a conventional torus array, six PEs each. Since communication paths 86 in the new 
with substantially reduced inter-PE wiring requirements. manifold array are between clusters, the wraparound con- 
As noted in the summary, the above clustering process is nection problem of the conventional torus array is elimi- 
general and may be employed to produce manifold arrays of 30 nated. That is, no matter how large the array becomes, no 
M clusters containing N PEs each from an NxM torus array. interconnection path need be longer than the basic inter- 
For example, the rhombus/cylinder approach to creating cluster spacing illustrated by the connection paths 88. This 
four clusters of five PEs, for a 5x4 torus array equivalent is is in contrast to wraparound connections of conventional 
illustrated in FIG. 12. Note that the vertical slices which torus arrays which must span the entire array, 
form the new PE clusters, for example, PE 4 0 , PE 3 A , PE 2;2 , 35 The block diagram of FIG. 16 illustrates in greater detail 
PEj 3 , and PE 0 0 maintain the transpose clustering relation- a preferred embodiment of a four cluster, sixteen PE, mani- 
ship of the previously illustrated 4x4 array. Similarly, as fold array. The clusters 44 through 50 are arranged, much as 
illustrated in the diagram of FIG. 13, a 4x5 torus will yield they would be in an integrated circuit layout, in a rectangle 
five clusters of four PEs each with the transpose relationship or square. The connection paths 88 and cluster switches are 
only slightly modified from that obtained with a 4x4 torus. 40 illustrated in greater detail in this figure. Connections to the 
In fact, transpose PEs are still clustered together, only in a South and East are multiplexed through the cluster switches 
slightly different arrangement than with the 4x4 clustered 86 in order to reduce the number of connection lines 
array. For example, transpose pairs PE 10 /PE 01 and PE 2 y between PEs. For example, the South connection between 
PE 3,2 were grouped in the same cluster within the preferred PE 1;2 and PE 2 2 is carried over a connection path 110, as is 
4x4 manifold array, but they appear, still paired, but in 45 the East connection from PE^ to PE^. As noted above, 
separate clusters in the 4x5 manifold array of FIG. 13. As each connection path, such as the connection path 110 may 
illustrated in the cluster-selection diagram of FIG. 14, the be a bit-serial path and, consequently, may be effected in an 
diagonal PEs, PE,-, where i-j, in an odd number by odd integrated circuit implementation by a single metallization 
number array are distributed one per, cluster. line. Additionally, the connection paths are only enabled 
The block diagrams of FIGS. 15A-15D illustrate the 50 when the respective control line is asserted. These control 
inter-cluster connections of the new manifold array. To lines can be generated by the instruction decoder/controller 
simplify the description, in the following discussion, unidi- 38 of each PE3 0 , illustrated in FIG. 3A. Alternatively, these 
rectional connection paths are assumed unless otherwise control lines can be generated by an independent instruction 
stated. Although, for the sake of clarity, the invention is decoder/controller that is included in each cluster switch, 
described with parallel interconnection paths, or buses, 55 Since there are multiple PEs per switch, the multiple enable 
represented by individual lines. Bit-serial communications, signals generated by each PE are compared to make sure 
in other words buses having a single line, are also contem- they have the same value in order to ensure that no error has 
plated by the invention. Where bus multiplexers or bus occurred and that all PEs are operating synchronously. That 
switches are used, the multiplexer and/or switches are rep- is, there is a control line associated with each noted direction 
licated for the number of lines in the bus. Additionally, with 60 path, N for North, S for South, E for East, and W for West, 
appropriate network connections and microprocessor chip The signals on these lines enable the multiplexer to pass data 
implementations of PEs, the new array may be employed on the associated data path through the multiplexer to the 
with systems which allow dynamic switching between connected PE. When the control signals are not asserted the 
MIMD, SIMD and SISD modes, as described in U.S. Pat. associated data paths are not enabled and data is not trans- 
No. 5,475,856 to P.M. Kogge, entitled, Dynamic Muhi- 65 ferred along those paths through the multiplexer. 
Mode Parallel Processor Array Architecture, which is The block diagram of FIG. 17 illustrates in greater detail 
hereby incorporated by reference. the interconnection paths 88 and switch clusters 86 which 
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link the four clusters 44 through 50. In this figure, the West upon the S. Y. Lee and J. K. Aggarwal algorithm for 

and North connections are added to the East and South convolution, the Manifold array would desirably be the size 

connections illustrated in FIG. 16. Although, in this view, of the image, for example, an NxN array for a NxN image, 

each processing element appears to have two input and two Due to implementation issues it must be assumed that the 

output ports, in the preferred embodiment another layer of 5 array is smaller than NxN for large N. Assuming the array 

multiplexing within the cluster switches brings the number size is CxC, the image processing can be partitioned into 

of communications ports for each PE down to one for input multiple CxC blocks, taking into account the image block 

and one for output. In a standard torus with four neighbor- overlap required by the convolution window size. Various 

hood transmit connections per PE and with unidirectional techniques can be used to handle the edge effects of the NxN 

communications, that is, only one transmit direction enabled 10 i™^- F° r example, pixel replication can be used that 

per PE, there are four multiplexer or gated circuit transmit effectively generates an (N+l)x(N+l) array. It is noted that 

paths required in each PE. A gated circuit may suitably due to the simplicity of the processing required, a very small 

include multiplexers, AND gates, tristate driver/receivers PE could be defined in an application specific implementa- 

with enable and disable control signals, and other such tion. Consequently, a large number of PEs could be placed 

interface enabling/disabling circuitry. This is due to the 15 in a Manifold Array organization on a chip thereby improv- 

interconnection topology defined as part of the PE. The net ing the efficiency of the convolution calculations for large 

result is that there are 4N 2 multiple transmit paths in the image sizes. 

standard torus. In the manifold array, with equivalent con- The convolution algorithm provides a simple means to 

nectivity and unlimited communications, only 2N 2 multi- demonstrate the functional equivalence of the Manifold 

plexed or gated circuit transmit paths are required. This 2 o Array organization to a torus array for North/East/South/ 

reduction of 2N 2 transmit paths translates into a significant West nearest neighbor communication operations, 

savings in integrated circuit real estate area, as the area Consequently, the example focuses on the communications 

consumed by the multiplexers and 2N 2 transmit paths is aspects of the algorithm and, for simplicity of discussion, a 

significantly less than that consumed by 4N 2 transmit paths. very small 4x4 image size is used on a 4x4 Manifold array. 

A complete cluster switch 86 is illustrated in greater detail 25 Larger NxN images can be handled in this approach by 

in the block diagram of FIG. 18. The North, South, East, and loading a new 4x4 image segment into the array after each 

West outputs are as previously illustrated. Another layer of previous 4x4 block is finished. For the 4x4 array no wrap 

multiplexing 112 has been added to the cluster switch 86. around is used and for the edge PEs O's are received from the 

This layer of multiplexing selects between East/South virtual PEs not present in the physical implementation. The 

reception, labeled A, and North/West reception, labeled B, 30 processing for one 4x4 block of pixels will be covered in this 

thereby reducing the communications port requirements of operating example. 

each PE to one receive port and one send port. Additionally, To begin the convolution example, it is assumed that the 

multiplexed connections between transpose PEs, ?E 1 3 and PEs have already been initialized by a SIM D controller, such 

PE 31 , are effected through the intra-cluster transpose con- as controller 29 of FIG. 3A, and the initial 4x4 block of 

nections labeled T. When the T multiplexer enable signal for 35 pixels has been loaded through the data bus to register Rl in 

a particular multiplexer is asserted, communications from a each PE, in other words, one pixel per PE has been loaded, 

transpose PE are received at the PE associated with the FIG. 19C shows a portion of an image with a 4x4 block to 

multiplexer. In the preferred embodiment, all clusters be loaded into the array. FIG. 19D shows this block loaded 

include transpose paths such as this between a PE and its in the 4x4 torus logical positions. In addition, it is assumed 

transpose PE. These figures illustrate the overall connection 40 that the accumulating sum register R0 in each PE has been 

scheme and are not intended to illustrate how a multi-layer initialized to zero. Though inconsequential to this algorithm, 

integrated circuit implementation may accomplish the R2 has also been shown as initialized to zero. The convo- 

entirety of the routine array interconnections that would lution window elements are broadcast one at a time in each 

typically be made as a routine matter of design choice. As step of the algorithm. These window elements are received 

with any integrated circuit layout, the IC designer would 45 into register R2. The initial state of the machine prior to 

analyze various tradeoffs in the process of laying out an broadcasting the window elements is shown in FIG. 20A. 

actual IC implementation of an array in accordance with the The steps to calculate the sum of the weighted pixel values 

present invention. For example, the cluster switch may be in a 3x3 neighborhood for all PEs follows, 

distributed within me PE cluster to reduce the wiring lengths The algorithm begins with the transmission 

of the numerous interfaces. 50 (broadcasting) of the first window element WOO to all PEs. 

To demonstrate the equivalence to a torus array's com- Once this is received in each PE, the PEs calculate the first 
munication capabilities and the ability to execute an image RO«RO+R2*R1 or R0=R0+W*P. The result of the calcula- 
processing algorithm on the Manifold Array, a simple 2D tion is then communicated to a nearest neighbor PE accord- 
convolution using a 3x3 window, FIG. 19A, will be ing to the convolution path chosen, FIG. 19B. For simplicity 
described below. The Lee and Aggarwal algorithm for 55 of discussion it is assumed that each operational step to be 
convolution on a torus machine will be used. See, S. Y. Lee described can be partitioned into three substeps each con- 
and J. K. Aggarwal, Parallel 2D Convolution on a Mesh trolled by instructions dispatched from the controller: a 
Connected Array Processor, IEEE Transactions on Patter broadcast window element step, a computation step, and a 
Analysis and Machine Intelligence, Vol. PAMI-9, No. 4, pp. communications step. It is noted that improvements to this 
590-594, July 1987. The internal structure of a basic PE 30, 60 simplified approach can be developed, such as, beginning 
FIG. 3A, is used to demonstrate the convolution as executed with major step 2, overlapping the window element broad- 
on a 4x4 Manifold Array with 16 of these PEs. For purposes cast step with the communications of result step. These 
of this example, the Instruction Decoder/Controller also points are not essential to the purpose of this description and 
provides the Cluster Switch multiplexer Enable signals. would be recognized by one of ordinary skill in the art. A 
Since there are multiple PEs per switch, the multiple enable 65 superscript is used to represent the summation step value as 
signals are compared to be equal to ensure no error has the operation proceeds. As an aid for following the commu- 
occurred and all PEs are operating in synchronism. Based nications of the calculated values, a subscript on a label 
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indicates the source PE that the value was generated in. The communications capabilities of transpose and neighborhood 

convolution path for pixel {ij} is shown in FIG. 19B. FIGS. communications, it represents a superior design to the stan- 

20-24 indicate the state of the Manifold Array after each dard and diagonal fold toruses of the prior art. 

computation step. The foregoing description of specific embodiments of the 

In FIG. 20B, WOO is broadcast to the PEs and each PE 5 invention has been presented foi -the purposes of illustration 

calculates R0'=0 + WOO*R1 and communicates RO 1 to the * nd descn P^; " J? tended * be exhaustive or to limit 

South PE where the received RO 1 value is stored in the PEs' the invention to the precise forms disclosed and many 

R0 modifications and variations are possible in light of the 

regis er . above teachings. The embodiments were chosen and 

In FIG. 21A, W10 is broadcast to the PEs and each PE described in order to best explain the principles of the 

calculates R0 2 =R0 1 +W10*R1 and communicates RO 2 to the invention and its practical application, to thereby enable 

South PE where the received RO 2 value is stored in the PEs' otners skilled in lhe art to best utilize me invention. It is 

register RO. intended that the scope of the invention be limited only by 

In FIG. 21B, W20 is broadcast to the PEs and each PE the claims appended hereto, 

calculates R0 3 -R0 2 +W20*R1 and communicates RO 3 to the J5 We claim: 

East PE where the received RO 3 value is stored in the PEs' 1. An array processor, comprising: 

register RO. N clusters wherein each cluster contains M processing 

In FIG. 22A, W21 is broadcast to the PEs and each PE elements, each processing element having a communi- 

calculates R0 4 =R0 3 +W21*R1 and communicates RO 4 to the cations port through which the processing element 

East PE where the received RO 4 value is stored in the PEs* 2 o transmits and receives data over a total of B wires; 

register RO. communications paths which are less than or equal to 

In FIG. 22B, W22 is broadcast to the PEs and each PE (M)(B)-wires wide connected between pairs of said 

calculates R0 5 «R0 4 +W22*R1 and communicates RO 5 to the clusters; each cluster member in the pair containing 

North PE where the received RO 5 value is stored in the PEs' processing elements which are torus nearest neighbors 

register RO. 25 to processing elements in the other cluster of the pair, 

In FIG. 23A, W12 is broadcast to the PEs and each PE each P ath i*™tting communications between said 

calculates R0 6 -R0 5 +W12*R1 and communicates RO 6 to the duster pairs in two mutually exclusive torus directions, 

North PE where the received RO 6 value is stored in the PEs' that *> South and Easl or and West or North and 

register RO 1 or North and West; and 

In FIG. 23B, W02 is broadcast to the PEs and each PE 30 mlllti P lexere connected to combine 2(M)(B)-wire wide 

calculates R0 7 =R0 S + W02*R1 and communicates RO 7 to the communications into said less than or equal to (M)(B)- 

West PE where the received R07 value is stored in the PEs' , Wlde p3ths • *"? ^ 

register RO array processor of claim 1, wherein the processing 

. , elements of each cluster communicate to the North and West 

In FIG. 24A, W01 is broadcast to the PEs and each PE 35 torus directions ^ one cluster and t0 the South and East 

calculates RO^RO +W01*R1 and communicates RO 8 to the torus directions with another cluster 

South PE where the received RO value is stored in the PEs' 3 ^ array proce ssor of claim 1, wherein the processing 

register RO. elements of each cluster communicate to the North and East 

In FIG. 24B, Wll is broadcast to the PEs and each PE torus directions with one cluster and to the South and West 

calculates R0 9 -R0 8 +W11*R1 and End. 40 torus directions with another cluster. 

At the end of the above nine steps each PE iV contains 4. The array processor of claim 1, wherein at least one 

(with reference to FIG. 19B): cluster includes an NxN torus transpose pair. 

5. The array processor of claim 1, wherein a cluster switch 

CiJ l^"t™™^^ comprises said multiplexers and said cluster switch is con- 

' J d J 45 nected to muthplex communications received from two 

For example, for i=5, and j«6 C 5 6 =W00P4,5+W10P5,5+ mutually exclusive torus directions to processing elements 

W20P6,5+W21P6,6+W22P6,7+W12P5,7+W02P4,7+ within a cluster 

W01P4,6+W11P5,6. 6. The array processor of claim 5, wherein said cluster 

It is noted that at the completion of this example, given the switch is connected to multiplex communications from the 

operating assumptions, four valid convolution values have 50 processing elements within a cluster for transmission to 

been calculated, namely the ones in PEs {(1,1), (1,2), (2,1), another cluster. 

(2,2)}. This is due to the edge effects as discussed previ- 7. The array processor of claim 6, wherein said cluster 

ously. Due to the simple nature of the PE needed for this switch is connected to multiplex communications between 

algorithm, a large number of PEs can be incorporated on a transpose processing elements within a cluster, 

chip, thereby greatly increasing the efficiency of the convo- 55 8. The array processor of claim 1, wherein N is greater 

lution calculation for large image sizes. than or equal to M. 

The above example demonstrates that the Manifold Array 9. The array processor of claim 1, wherein N is less than 

is equivalent in its communications capabilities for the M. 

four — North, East, South, and West — communications 10. An array processor, comprising: 

directions of a standard torus while requiring only half the 60 N clusters wherein each cluster contains M processing 

wiring expense of the standard torus. Given the Manifold elements, each processing element having a communi- 

Array's capability to communicate between transpose PEs, cations port through which the processing element 

implemented with a regular connection pattern, minimum transmits and receives data over a total of B wires and 

wire length, and minimum cost, the Manifold Array provides each processing element within a cluster being formed 

additional capabilities beyond the standard torus. Since the 65 in closer physical proximity to other processing ele- 

Manifold Array organization is more regular as it is made up ments within a cluster than to processing elements 

of the same size clusters of PEs while still providing the outside the cluster; 
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communications paths which are less than or equal to 
(MXB)-wires wide connected between pairs of said 
clusters, each cluster member in the pair containing 
processing elements which are torus nearest neighbors 
to processing elements in the other cluster of the pair, 5 
each path permitting communications between said 
cluster pairs in two mutually exclusive torus directions, 
that is, South and East or South and West or North and 
East or North and West; and 

multiplexers connected to combine 2(M)(B)-wire wide 10 
communications into said less than or equal to (M)(B)- 
wires wide paths between said cluster pairs. 

11. The array processor of claim 10, wherein the process- 
ing elements of each cluster communicate to the North and 
West torus directions with one cluster and to the South and 15 
East torus directions with another cluster. 

12. The array processor of claim 10, wherein the process- 
ing elements of each cluster communicate to the North and 
East torus directions with one cluster and to the South and 
West torus directions with another cluster. 20 

13. The array processor of claim 10, wherein at least one 
cluster includes an NxN torus transpose pair. 

14. The array processor of claim 10, wherein a cluster 
switch comprises said multiplexer and said cluster switch is 
connected to mutliplex communications received from two 25 
mutually exclusive torus directions to processing elements 
within a cluster. 

15. The array processor of claim 14 wherein said cluster 
switch is connected to multiplex communications from the 
processing elements within a cluster for transmission to 30 
another cluster. 

16. The array processor of claim 15, wherein said cluster 
switch is connected to multiplex communications between 
transpose processing elements within a cluster. 

17. The array processor of claim 10, wherein N is less than 35 
or equal to M. 

18. The array processor of claim 10, wherein N is greater 
than M. 

19. The array processor of claim 10, wherein communi- 
cations between processing elements is bit : serial and each 40 
processing element cluster communicates with two other 
clusters over said communications paths. 
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20. The array processor of claim 10, wherein the com- 
munications paths between processing elements comprise a 
data bus. 

21. The array processor of claim 10, wherein said com- 
munications paths are bidirectional paths. 

22. The array processor of claim 10, wherein said com- 
munications paths comprise unidirectional signal wires. 

23. The array processor of claim 10, wherein P and Q are 
the number of rows and columns, respectively, of a torus 
connected array having the same number of PEs as said 
array, and P and Q are equal to N and M, respectively. 

24. The array processor of claim 10, wherein P and Q are 
the number of rows and columns, respectively, of a torus 
connected array having the same number of PEs and P and 
Q are equal to M and N, respectively. 

25. An array processor, comprising: 

processing elements (PEs) PE f j9 where i and j refer to the 
respective row and column PE positions within a 
conventional torus-connected array, and where 
i=0,l,2, . . . N-l and j=0, 1, 2, . . . N-l, said PEs 
arranged in clusters PE (lVfl w„^ 0 . + ^ K „^, for any 
i,j and for all a e{0, 1, . . . ,N-1}, wherein each cluster 
contains an equal number of PEs; and 

cluster switches connected to multiplex inter-PE commu- 
nications paths between said clusters thereby providing 
inter-PE connectivity equivalent to that of a torus- 
connected array. 

26. The array processor of claim 25, wherein said cluster 
switches are further connected to provide direct communi- 
cations between PEs in a transpose PE pair within a cluster. 

27. The array processor of claim 25, wherein said clusters 
are scaleable. 

28. A method of forming an array processor, comprising 
the steps of: 

arranging processing elements in N clusters wherein each 
cluster contains M processing elements, such that each 
cluster includes processing elements which communi- 
cate only in mutually exclusive torus directions with 
the processing elements of at least one other cluster; 
and 

multiplexing said mutually exclusive torus direction com- 
munications. 

***** 
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